In [ ]:
epochs = 15

Federated Learning - SMS spam prediction with a GRU model

In this tutorial you are going to see how you can leverage on PySyft and PyTorch to train a 1-layer GRU model using Federated Learning.

The data used for this project was the SMS Spam Collection Data Set available on the UCI Machine Learning Repository. The dataset consists of c. 5500 SMS messages, of which around 13% are spam messages.

The objective here is to simulate two remote machines (that we will call Bob and Anne), where each machine have a similar number of labeled datapoints (SMS labeled as spam or not).

Author: André Macedo Farias. Github: @andrelmfarias | Twitter: @andrelmfarias

I also wrote a blogpost about this tutorial and PySyft, feel free to check it out: Private AI — Federated Learning with PySyft and PyTorch

Useful imports


In [ ]:
import numpy as np
from sklearn.metrics import roc_auc_score

import torch
from torch import nn, optim
from torch.utils.data import TensorDataset, DataLoader

import warnings

warnings.filterwarnings("ignore")

Loading data

As we are most interested in the usage of PySyft and Federated Learning, I will skip the text-preprocessing part of the project. If you are interested in how I performed the preprocessing of the raw dataset you can take a look on the script preprocess.py.

Each data point of the inputs.npy dataset correspond to an array of 30 tokens obtained form each message (padded at left or truncated at right)

The label.npy dataset has the following unique values: 1 for spam and 0 for non-spam


In [ ]:
inputs = np.load('./data/inputs.npy')
labels = np.load('./data/labels.npy')

In [ ]:
VOCAB_SIZE = int(inputs.max()) + 1

Training model with Federated learning

Training and model hyperparameters


In [ ]:
# Training params
EPOCHS = epochs
CLIP = 5 # gradient clipping - to avoid gradient explosion (frequent in RNNs)
lr = 0.1
BATCH_SIZE = 32

# Model params
EMBEDDING_DIM = 50
HIDDEN_DIM = 10
DROPOUT = 0.2

Initiating virtual workers with Pysyft

In this part we are going to separate the dataset in training and test sets following the ratio 80/20. Each of these datasets will be split in two and will be sent to "Bob's" and "Anne's" machines in order to simulate remote and private data.

Please note that in a real case, such datasets will be already in the remote machines and the preprocessing will be performed before hand by their own devices.


In [ ]:
import syft as sy

In [ ]:
labels = torch.tensor(labels)
inputs = torch.tensor(inputs)

# splitting training and test data
pct_test = 0.2

train_labels = labels[:-int(len(labels)*pct_test)]
train_inputs = inputs[:-int(len(labels)*pct_test)]

test_labels = labels[-int(len(labels)*pct_test):]
test_inputs = inputs[-int(len(labels)*pct_test):]

In [ ]:
# Hook that extends the Pytorch library to enable all computations with pointers of tensors sent to other workers
hook = sy.TorchHook(torch)

# Creating 2 virtual workers
bob = sy.VirtualWorker(hook, id="bob")
anne = sy.VirtualWorker(hook, id="anne")

# threshold indexes for dataset split (one half for Bob, other half for Anne)
train_idx = int(len(train_labels)/2)
test_idx = int(len(test_labels)/2)

# Sending toy datasets to virtual workers
bob_train_dataset = sy.BaseDataset(train_inputs[:train_idx], train_labels[:train_idx]).send(bob)
anne_train_dataset = sy.BaseDataset(train_inputs[train_idx:], train_labels[train_idx:]).send(anne)
bob_test_dataset = sy.BaseDataset(test_inputs[:test_idx], test_labels[:test_idx]).send(bob)
anne_test_dataset = sy.BaseDataset(test_inputs[test_idx:], test_labels[test_idx:]).send(anne)

# Creating federated datasets, an extension of Pytorch TensorDataset class
federated_train_dataset = sy.FederatedDataset([bob_train_dataset, anne_train_dataset])
federated_test_dataset = sy.FederatedDataset([bob_test_dataset, anne_test_dataset])

# Creating federated dataloaders, an extension of Pytorch DataLoader class
federated_train_loader = sy.FederatedDataLoader(federated_train_dataset, shuffle=True, batch_size=BATCH_SIZE)
federated_test_loader = sy.FederatedDataLoader(federated_test_dataset, shuffle=False, batch_size=BATCH_SIZE)

Creating simple GRU (1-layer) model with sigmoid activation for classification task

For educational purposes, we built a handcrafted GRU with linear layers whose architecture and code you can check on handcrafted_GRU.py

As the focus of this notebook is the usage of Federated Learning with PySyft, we not show the construction of the model here.


In [ ]:
from handcrafted_GRU import GRU

In [ ]:
# Initiating the model
model = GRU(vocab_size=VOCAB_SIZE, hidden_dim=HIDDEN_DIM, embedding_dim=EMBEDDING_DIM, dropout=DROPOUT)

Training and validation


In [ ]:
# Defining loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=lr)

For each epoch we are going to compute the training and validations losses, as well as the Area Under the ROC Curve score due to the fact that the target dataset is unbalaced (only 13% of labels are positive).


In [ ]:
for e in range(EPOCHS):
    
    ######### Training ##########
    
    losses = []
    # Batch loop
    for inputs, labels in federated_train_loader:
        # Location of current batch
        worker = inputs.location
        # Initialize hidden state and send it to worker
        h = torch.Tensor(np.zeros((BATCH_SIZE, HIDDEN_DIM))).send(worker)
        # Send model to current worker
        model.send(worker)
        # Setting accumulated gradients to zero before backward step
        optimizer.zero_grad()
        # Output from the model
        output, _ = model(inputs, h)
        # Calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # Clipping the gradient to avoid explosion
        nn.utils.clip_grad_norm_(model.parameters(), CLIP)
        # Backpropagation step
        optimizer.step() 
        # Get the model back to the local worker
        model.get()
        losses.append(loss.get())
    
    ######## Evaluation ##########
    
    # Model in evaluation mode
    model.eval()

    with torch.no_grad():
        test_preds = []
        test_labels_list = []
        eval_losses = []

        for inputs, labels in federated_test_loader:
            # get current location
            worker = inputs.location
            # Initialize hidden state and send it to worker
            h = torch.Tensor(np.zeros((BATCH_SIZE, HIDDEN_DIM))).send(worker)    
            # Send model to worker
            model.send(worker)
            
            output, _ = model(inputs, h)
            loss = criterion(output.squeeze(), labels.float())
            eval_losses.append(loss.get())
            preds = output.squeeze().get()
            test_preds += list(preds.numpy())
            test_labels_list += list(labels.get().numpy().astype(int))
            # Get the model back to the local worker
            model.get()
        
        score = roc_auc_score(test_labels_list, test_preds)
    
    print("Epoch {}/{}...  \
    AUC: {:.3%}...  \
    Training loss: {:.5f}...  \
    Validation loss: {:.5f}".format(e+1, EPOCHS, score, sum(losses)/len(losses), sum(eval_losses)/len(eval_losses)))
    
    model.train()

Well Done!

Et voilà! You have just trained a model for a real world application (SMS spam classifier) using Federated Learning!

Conclusion

You can see that with the PySyft library and its PyTorch extension, you can perform operations with tensor pointers such as you can do with PyTorch API.

Thanks to this, you were able to train spam detector model without having any access to the remote and private data: for each batch you sent the model to the current remote worker and got it back to the local machine before sending it to the worker of the next batch.

You can also notice that this federated training did not harm the performance of the model as both losses reduced at each epoch as expected and the final AUC score on the test data was above 97.5%.

There is however one limitation of this method: by getting the model back we can still have access to some private information. Let's say Bob had only one SMS on his machine. When we get the model back, we can just check which embeddings of the model changed and we will know which were the tokens (words) of the SMS.

In order to address this issue, there are two solutions: Differential Privacy and Secured Multi-Party Computation (SMPC).

Differential Privacy would be used to make sure the model does not give access to some private information.

SMPC, which is one kind of Encrypted Computation, in return allows you to send the model privately so that the remote workers which have the data cannot see the weights you are using.

Congratulations!!! - Time to Join the Community!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

Star PySyft on GitHub

The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.

Pick our tutorials on GitHub!

We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.

Join our Slack!

The best way to keep up to date on the latest advancements is to join our community!

Join a Code Project!

The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue.

If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!